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Xonromanization: Prospects for Improving 
Automated Cataloging of Items in Other 
Writing Systems 

SUMMARY 

This paper describes the dilemma of cataloging works in other writing 
systems, outUnes some characteristics of these writing systems, discusses the 
imphcations of these characteristics for input, retrieval, sorting and display 
needed for adequate onhne catalogs of such works, suggests some reasons 
these needs have not been met and explores some ways they might be met. 
e.g.. Unicode.^ 

IXTRODUCTIOX 

Tlie people of the world write and read documents in many systems 
other than the roman alphabet. Libraries acquire documents in nonroman 
scripts so readers can study and better understand these people. At LC over 
a third of current book cataloging ir- for nonroman items. To organize and 
service these documents Ubrarians use romanization because the resulting 
records are easv to interfile with ones for roman alphabet documents — thus 
creating a catalog <^f an entire collection in a single .\ to Z sequence. Readers 
of nonroman documents, on the other hand, want to see the originsd script 
because it is more famiUar to them than romanized versions of text for 
avfhors. titles etc. Few of us would recognize our names rendered in Arabic 
or Devanagari script (figure 1) but we routinely expect those seeking books 
in such scripts to use romanized versions of headings for works they want.^ 

'This paper is an updated version of a talk gi%*en on July 22. 1991 at a meeting of the 
Library of Congress' Cataloging Forum. Ihe opinions expressed in it are purely personal, 
not a comniitment by ITS to develop such systems. 

*^ Those wanting to explore more fully the adequacy of romanization for bibliographic 
control should consult two articles: C. Sumner Spalding. '^Romanization Reexamined."* Li 
brary Resources Technical Services 22, no.l (Winter 1977): 3-12 and Hans H. Wellisch. 
"Mujtiscript and Multilingual Bibliographic Control: .A.lternatives to Romanization," Li- 
brary Resources Technical Services 22, no.2 (Spring 1978): 179-90. 
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SOXROMAyiZATION 





Figure 1: "James Agenbroad'' in Arabic and Devanagari 



To accommodate the wants of librarians and readers the cataloging 
rules provide for giving headings in the roman alphabet, but descriptive 
elements in their original script whenever possible. In other words, in the 
card catalog era, if readers could guess how librarians rouianized the head- 
ing for author or title they sought, then they could find the card with the 
original script which they could then read. (The need to help readers un- 
derstand our romanization schemes could partially account for a need for 
more reference hbrarians in divisions that deal with these scripts than in 
divisions that handle only roman alphabet items.) To further assist readers 
for whom romanized headings are unclear, the group that approves changes 
to the MARC formats, the ALA Interdivisional Committee on Machine- 
Readable Bibhographic Information (MARBI), has added provisions in the 
bibhographic and authority formats respectively to allow headings and cross 
references from headings in other writing systems. 

The LC Information Bulletin for April 13, 1979 states: "The Library 
reiterates that it is still firmly committed to a long-range pohcy of inputting 
machine -re ad able bibhographic record, in a combination of nonroman and 
roman characters, in hne with the present manual approach." 

The two major bibhographic utihties. the Onhne Computer Librciry 
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Center (OCLC) and the Research Libraries Group (RLG), have invested con- 
siderable resources and have had commensurate success in this area. OCLC 
zJlows input, storage and display of Chinese, Japanese and Korean. RLG's 
Research Libraries Information Network (RLIN) hcindles these plus Cyril- 
Uc, Hebrew and Arabic. LC uses RLIN for cataloging books in Chinese, 
Japeuiese, Korean, Hebrew and Arabic. LC now uses OCLC for creating 
MARC records with the original script for Chinese, Japanese and Korean 
serials. Unfortunately few readers are authorized and trsuned to search non- 
romaji documents on the bibliographic utihties. If they were, it would be 
interesting to learn their reaction to searching original script headings which 
the cataloging rules do not prescribe but which MARC aJlows. As the use of 
Internet 3i\d LC Direct becomes widespread readers of nonromcji documents 
may want to search for them from a terminal in their ofRce and then see the 
original script of at least the bibhographic record there. 

The bulk of this paper categorizes nonroman writing systems into four 
groups and discusses features of each that have implications for the automa- 
tion of cataloging works in each group. (Table 1) The four groups with 
their chief distinguishing characteristics are: European — upper/lower case; 
Semitic — read right to left: Indie — imphcit vowel: and Han (Chinese)^ — very 
large character repertoire. (By omitting Georgian and Amharic this taxon- 
omy and the table oversimplify the situation.) 

It is important to note that just as an online catalog for items in our own 
alphabet requires more elaborate retrieval and sort capabilities than typi- 
cal word procesring software provides, an effective onhne catalog for items 
in other writing systems also requires more than the mere display of the 
elements of a writing system such as a Russian, Hindi or Japanese word pro- 
cessing program would provide. Though LC's Hebrsuc Section has a Hebrew 
script title card catalog whose sorting begins with K, sorting on nonroman 
characters is not required by AACR2. Some users of MARC records do 
not have hardware needed to display nonromzai writing systems. To give 
them some access to records contsuning nonroman text, the MARC format 
CclUs for also giving parcJlel romanized versions of all text given in other 
writing systems — not just the headings. Some LC romzaiization schemes 
are nearly reversible by computer prograins so the feasibility of generating 
provisional versions of needed parciUel fields will also be considered. Since 
several languages often use the same script while most of our romajiization 
tables convert specific languages to our alphabet, informing the computer of 
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Key to Table 

Reversible romanization: an estimate of how well software could derive the original 
script from text data romanized according to the LC scheme for a particular lan- 
guage; 0 = useless, 9 = accurate. Upper/lower case: indicates which scripts make 
this distinction. Known sort order: indicates scripts with an "alphabetic*' order fa- 
miliar to all its readers. Initial article: indicates whether the languages use articles 
before nouns or adjectives which need to be ignored for filing and possibly for key- 
word searching when they are written as part of the word as in Semitic languages 
and elisions, e.g., Thistoire. Word space: indicates whether or not the languages 
separate words v;ith spaces. Those that do pose fewer problems for romanization 
and keyword searching. Inflected: indicates languages that often alter words to 
show grammatical categories: singular/plural nominative/genitive, past/present, 
etc. Direction: indicates languages read from left to right or right to left. It ex- 
cludes Mongolian in vertical script. Context sensitive: this indicates scripts whose 
letters vary visually depending on their environment. Diacritics: indicates scripts 
whose letters may have marks superimposed above, beneath or beside them. Hindi, 
etc.: Includes the following scripts with similar characteristics: Tibetan, Gurmukhi. 
Gujarati, Bengali, Oriya, Telugu, Kannada. Malayalam and Sinhalese. 

Table 1: Script groups and some characteristics affecting their automation 
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the language of a nonroman text string would improve the performance of 
such software.^ 

I exclude some writing systems with minimal relevance to cataloging 
at LC: Mongolian in vertical script, Eskimo and Cree in Evans' syllabary, 
Syriac, Coptic, Cherokee, unscheduled leinguages of India with their own 
scripts. Chinese minority, i.e., non-Han, leinguages with, their own scripts, 
Maldiviam, traditional scripts of Indonesia and the Philippines, and extinct 
writing systems (deciphered or not) such as cuneiform, hieroglyphs, Indus, 
Easter Island, Mayan, Kharoshthi and vsurious Central Asian scripts. 



EUROPEAN SCRIPTS 



This group contains scripts which distinguish between capital and low- 
ercase letters: Greek, CyriUic and Armenian. As it does for roman, this 
distinction complicates input cuid must be ignored during retrieval and sort- 
ing. The fewer the languages that us? a script, the easier it is to define the 
sequence of letters for sorting. This means defining the alphabetic order of 
letters for Greek and Armenian presents few problems. CyriUic script on 
the other hand is used not only with several Slavic languages of Europe — 
Russian, Serbian, Ukrainian, Bulgarian — but also, with various extra letters 
and diacritics, to write many Asian lainguages of the former Soviet Union, 
e.g., Uzbek. Still it is probably possible to include these special letters in the 
sequence of CyriUic letters as we cope with Scandinavian letters when sort- 
ing roman letters. Greek has initial sirticles that must be ignored, the others 
do not. Greek is mildly context sensitive — one letter, sigma, appears differ- 
ently at the end of a word. If the final sigma is separately keyed and stored 
with its own code this may need to be normalized for fiUng. If, instead, 
a single code for lower case sigma is used, the output software (printing 
and terminal displays) must look ahead to determine which form is wanted. 
Otherwise display of these scripts is not harder than doing the roman alpha- 
bet. In inflected languages words change to show number, gender, case, etc. 
EngUsh is sUghtly inflected so when using the FIND command for keyword 

recent article on stemming, i.e., reducing words to their uninflected root forms, 
demonstrates the importance of knowing the language of the text being processed: Mirko 
Popovic and Peter Willett, "The Effectiveness of Stemming for Natural-Language Access 
to Slovene Textual Data," JASIS, Journal of the American Society for Information Science 
43, no.5 (June 1992): 384-90. 
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NONROMANIZA TION 



searching one must seek the singular and plural forms of nouns. Several of 
these languages are quite inflected so keyword searching as implemented in 
MUMS would be less effective. For exzimple, if nouns in a language have 
four cases (nominative, genitive, dative, and accusative), and two numbers 
(singular and plural), one would need to search for eight (4 x 2) forms of 
each noun. Writing software to generate provisional versions of romanized 
fielas from the original script for cataloger review is probably worth explor- 
ing for Greek and the major Slavic languages that use Cyrillic — assuming 
the language code is present. 

SEMITIC SCRIPTS 

This group contains Hebrew and Arabic scripts. Hebrew is used with a 
few other languages, mainly Yiddish; Arabic is used with many languages in- 
cluding Persian, Urdu, Pushto, Tajik, Sindhi, Kashmiri, Uighur and Malay. 
As with the roman and Cyrillic scripts, there are extra letters and diacritics 
for languages other than Arabic. Not just titles but also Hebrew and Ara- 
bic personal names have initial articles which must be ignored in sorting. 
Articles are not written separately (Uke the French word "rhistoire'') which 
makes keyword retrieval more difficult. Many vowels are seldom written and 
should probably be ignored for sorting. Current LC romanization schemes 
call for supplying the vowels which is quite labor intensive. This means gen- 
eration of provisional parallel fields for catalogers to review could probably 
only generate the original script from the romanized form (rather than vice 
versa) since the computer could not predict the vowels. For automation the 
chief difficulty is that these languages are written and read from right to left. 
This poses major problems for transmission, sorting £tnd display. Though it 
appears at the right margin, the first letter of an Arabic or Hebrew title is 
wanted first in 245 field of the MARC record so an effective title key (PTK) 
can be built. This is also important for sorting. Unlike letters, numbers aure 
written and read in the same direction as they are in roman titles; this com- 
pUcates keying, transmitting, storing, sorting and displaying a Semitic title 
similar to '76 trombones". The need to combine in a single field Semitic 
and left to right text strings (e.g., the title of a Hebrew/Russian dictionary) 
makes matters even more difficult. Like Greek, Hebrew is sUghtly context 
sensitive — five letters have a separate final form to be dealt with. 

Arabic is very context sensitive — all but a few letters appear in four 
forms depending on their position in a word — initial, middle, final or with a 
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space on both sides of it. For example the letter Ba alone is ^-|-^ ; at the 
end, middle and start of a word it appeairs as ^"^^ '^■ respectively. In 
most modern Arabic text computer systems (including RLG's) one keys a 
single letter (regardless of its position) which is stored with a single code. 
Then the display software determines and generates the appropriate visual 
form. Many special letter combinations analogous to romaii Ugatures sucn as 
fi and ffl are desirable for high class Arabic typography, but it is mandatory 
to use the lam-alif combination whenever these letters occur together. If, 
however, this combination is stored with its own code, sorting software must 
expeaid it. While Urdu uses the Arabic script, instead of a Unear right to 
left sequence it uses the nastaljq style in which words aiid phrases usuaJly 
appear diagonally e.g., / ^^."^ [J^*^ • ^^ hen ccirds for Urdu items 
displayed Urdu they used horizontal Arabic type, not nastaliq, so perhaps 
the onhne catalog need not do so either. At least one Central Asian country 
formerly part of the Soviet Union. Tajikistaii, again allows printing in an 
expeaided version of Arabic. 



IXDIC SCRIPTS 

By Indie scripts I meaii the indigenous scripts India and Nepal: De- 
vanagari (for Hindi, Marathi and XepaU), Gurmukhi (for Panjabi), Gujarati, 
Bengah, Oriva. Telugu. Keamada. Tamil and Malayalam, and the related 
scripts used in Tibet, Sri Lanka and Southeast Asia (Burmese. Thai, Lao, 
Khmer and Javanese in Kawi script). 

My knowledge is largely Umited to the scripts used in India. While 
these scripts look very different, in ahnost cdl cases they share the follow- 
ing characteristics: I. Alphabetic order — the vowels come first followed by 
consonants from K produced at the back of the throat to M produced with 
the hps. 2. The most common vowel sound "a"' is impUcit in consonants, 
not written unless it begins a syllable. 3. Except at the start of syllables, 
other vowels are written as modifiers of consonants — above, below, on one 
or both sides of the consonant — where they override the impUcit consonant. 
4. When a consonant has no vowel because it is pronounced together with 
one or more foUowing consonants (e.g., "st") the consonants are written in 
a fused form called a conjunct consoneait. For example, in Devanagari script 
which is in all probabiUty the most widely used alphabet of South Asian 
origin, Sa - ^ and Ta = ^ but Sta - . (Figure 2 shov s vowel 
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modifiers and conjunct consonants for the word "moonlight" in Hindi, Tamil 
and Malayalam.) 




B r i I B S 1 



ca 



n d r i k a 




ca n 



15 ■ Bl 




1 r i k 




Figure 2: The word "candrika'' in Hindi, Tamil cind Malayalam 

In India, though words can be quite long, they are written with spaces 
between them. In Southeast Asian scripts spaces do not separate words. 
Keyword extraction and retrieval will be difficult for languages that do not 
use spaces. Some of the languages using these scripts are highly inflected; 
hke the need to request both the singular and plural forms of Enghsh nouns 
with the FIND command, these inflected forms make keyword retrieval more 
difficult. Keying is not particularly difficult. So long as there is a means to 
indicate the absence of a voweL displav programs, though complex, caii be 
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devised to cope with vowel indicators and conjunct consonants — the Indi- 
ans have n'ritten joftwcire to do so. The order of these alphabets presents 
few problems for sorting progrsuns. Certain consonzmtal sounds that follow 
vowels but by Indian filing tradition cause a syllable to precede the sanie let- 
ters without the following consonzmt may prove difficult. The romaiiization 
schemes LC uses are sufficiently reversible to make generation of provisional 
versions of romcinized tex:t worth exploring — at least for languages that use 
spaces. 

EAST ASIAN SCRIPTS 

Unlike the previously discussed writing systems which use fewer than 
a hundred "letters" assigned to components of the sound system of a par- 
ticular Izmguage, Chinese is written with thousands of different characters 
which more nearly represent either the idea of a word or its idea and its 
sound. Japzmese uses these chciracters (calling them kcinji) and about forty 
other characters (called kzina) that represent sounds much as the romzin 
alphabet does. Similcirly. South Koreans write with a mixture of Chinese 
(caUed hzmji) and syllabic characters (called hangul). Hangul syllables are 
built from separate elements for the constituent vowel and consonemts which 
is somewhat analogous to building syllables in Indie scripts. In North Ko- 
rea only hangul are used. For purposes of automation Japsmese kana, and 
Korean hangul pose no new difficulties — they are few in number and have a 
known sequence for sorting."* All three languages are written without spaces 
which makes keyword indexing and retrieval difficult. The existence of tra- 
ditional and simplified forms of many Chinese characters which must be dis- 
played differently but treated as the same for retrieval and sorting purposes 
further complicates matters. Procedures for assigning word boundaries for 
romzinized texts are complex and time consuming. Keyword access will be 
ineffective unless a searcher's notion of what constitutes a word matches the 
cataloger's. 

It is the sheer number of their characters that msikes these writing sys- 
tems challenging both to readers and computers. There are far too many 

I recently learned that North and South Korea use different sort sequences for hangul 
but there is a proposal to unify them. For details see Kyongsok Kim, "A Future Direc- 
tion in Standardizing International Character Codes — with Special Reference to ISO/IEC 
10646 and Unicode" Computer Standards & Interfaces 14, no.3 (May 1992): 209-21. > 
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to fit onto a single keyboard so various input schemes exist. Typically they 
involve keying an approximation of a character — its shape, its sound, its 
strokes or some combination of them — and then selecting the desired chcir- 
acter from a menu of those that match the approximation. Because there are 
so many characters, there is no one widely accepted collating sequence for 
them analogous to our A-Z alphabetic order. Instead, there are many differ- 
ent schemes for sequencing Chinese characters. The Japanese and Koreans 
generally sort their characters by the accepted order of their sounds as rep- 
resented in kana and hangul respectively. It would be possible to store the 
kana and hangul equivalents for sorting. For Chinese and Korean, generation 
of provisional romanized equivalents might be possible. For Japanese, doing 
so is less promising because many kanji have two pronunciations. Because 
Chinese characters are very intricate their display at terminals or printers 
requires higher resolution devices which also cost more. Their number and 
higher resolution requirements mean more storage. While the number of 
Chinese characters is finite, it is not fixed so a method is needed to add 
characters occasionally to the input and output devices. 

CONCLUSIOr 

This paper has not listed every detail of every writing system found in 
works LC catalogs. A few other factors must be mentioned. For reasons 
of widest possible utility the MARC format is by intention independent of 
a single hardware or software vendor's offerings. This has consequences for 
costs and speed of development. If LC could go it alone we would be further 
than we are. Second, while LC acquires many materials written in nonroman 
scripts, their users are far from a united and vocal audience. If they were we 
would have made more effort to satisfy their needs. Elsewhere work has been 
done with automation of virtually every script mentioned (and even some 
of those I excluded). Until recently this work has usually involved romaii 
and one other writing system; on the other hand, continuing the Library's 
integrated catalog requires a many scripts aproach. 

Fortunately the prospect of global markets has made the computer in- 
dustry broaden its perspective. We now have on the horizon the beginnings 
of an all-scripts approach which comes closer to the Library's needs. This 
has resulted in the Unicode and the ISO 10646 efforts to definf^ aii integrated 
character set standard for all writing systems. If termincJ and printer ven- 
dors implement this character set and if MARBI aiid LC adopt it too, we 
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could make our online catalog as legible and effective for readers as the card 
catalog was for finding works in other scripts. It could be even more effective 
if we cre-^te headings in the original script. 

In the following pages (not part of my talk) I discuss some ways we 
could use Unicode in MARC to let us reaUze such improvements. We should 
be able to select and and implement an approach that would free us from the 
input, storage and display aspects of nonroman scripts so we can concentrate 
on the nonroman reirieval and sorting issues. The basic problem will soon 
be political, not technical — given our limited resources, what priority does 
effective catalog access and display fcr works in nonroman scripts have? Can 
those who want improvements in access to materials in oth*='.r scripts raise 
their priority? 
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POSSIBLE ROLE OF UNICODE IN MARC 

The preceding pages have briefly described features of various nonroman 
writing systems that must be dealt with to improve the catalogin^^ functions 
for works in these writing systems. After a short explanation., of MARC and 
Unicode, this section examines some ways MARC might use Unicode. 

A well known byproduct of the excellence needed in a catalog of a col- 
lection as vast as the Library's has been the acceptance of LC cataloging 
by other hbraries. Since the late six<"^es the medium of distribution for this 
cataloging has increasingly been the MARC format. This format defines 
a record structure and the means for identifying the elements of a bibho- 
graphic record so others can use the data for their needs. This format aJso 
includes a character set, ''the ALA character set," which was revolutionary 
when it was introduced because it specified codes for many special charac- 
ters (e.g., .E. L and £) and diacritics (e,g.. x. x, x. ^f. xx. x. etc.. aU shown 
here with x) needed to transcribe accurately titles in foreign languages that 
use our alphabet. (A character set is a repertoire of letters, punctuation, 
numerals, diacndcs. etc. and the unique computer code assigned to each.) 
More recently character sets for the Cyrillic. Hebrew, Arabic alphabets and 
one for Chinese, Japanese and Korean characters have been added to the 
MARC format definition but these characters have not been implemented 
on systems maintained at LC. Besides the reasons already mentioned these 
character sets have not been implemented because it would be expensive to 
do so. 

Unicode is an effort to define a character set that includes the letters, 
characters, punctuation, etc. for aU the world's text writing systems. (It 
does not cover pictorial matter, color or musical notation but cataloging 
does not require them.) Unicode wiU probably become an international 
standard, ISO 10646, late in 1992. Software and terminal vendors wiU then 
begin to implement it in their products to facihtate sales to foreign and 
multinational customers who need to communicate widely. I expect that 
Unicode wiU be as revolutionary as the ''ALA character set'' once was. When 
terminals with Unicode become commercially available they wiU reduce the 
cost of implementing the improvements described above — but only i/MARC 
adopts L'nicode. 

Three features of Unicode must be kept in mind. First, at present it 
does hot contain a few characters in the ALA set, mainly the hgature used 
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in romaiiization, e.g., ts, and the double width tilde, e.g.. ng, which is used 
very seldom. This could be solved either by getting them restored to Unicode 
(they were in a draft) or by adding them in the private use space. The former 
is preferable. Second, it uses 16 bits per character instead of 8. (This is how it 
gets enough different codes for so many characters.) The approach is roughly 
analogous lo changing braiUe from six dots to twelve. It is a major change for 
anyone who will use Unicode. Third, the code for a diacritic foUows the letter 
it modifies; in MARC the diacritic comes first. This is a significant change 
but it effects only MARC software, not aU users of Unicode. It could be 
overcome by a database upgrade that reversed the sequence of aU diacritics 
and changed any software that processed diacriucs~not just software for 
input and display but for retrieval and sorting as weU. Such a conversion 
would require close coordination with users of MARC data. 

The present treatment of Indie scripts in Unicode leaves something to be 
desired. The codes for many letters differ from those in the relevant Indian 
standard, IS 13194 1991. and they should not. Some Indie scripts display 
some vowel signs on two sides of a consonant . Unicode has added an extra 
code for the second part of such signs. These are superfluous: they obscure 
the shared symmetry that is the hallmark of Indian scripts; unless removed 
they wiU complicate exchanging software and data with Indian organizations 
that foUow their standard for their scripts. 

Assuming the above are resolved, the Unicode options I can see are: 

1. Do nothing. This would be appropriate if vendors do not implement Uni- 
code. If they do. this would unnecessarily perpetuate and increase the 
separation between bibliographic and other text data processing applica- 
tions. It is contrary to the trend toward networking. 

2. Define Unicode as the new MARC character set so every character is 16 
bits long. This would render virtually all MARC software obsolete. This 
is as extreme as the first option but in the opposite direction. 

3. Use an escape sequence to invoke Unicode as needed. An escape sequence 
announces that a new character set is in effect. This is the technique 
now used in MARC to invoke Cyrillic. Arabic and other character sets. 
Though ISO has defined an escape sequence for Unicode, registration 
number 162. vendor implementations of Unicode may not allow this ap- 
proach. A Unicode escape sequence could be adopted by MARC in at 
least three ways: 
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a. As the only escape sequence; it would be used whenever the need arose 
for a character not in ASCII, the US standard set which assigns codes 
for A-Z, a-z, 0-9 and punctuation. Most microcomputers and word 
processors already use ASC I. 

b. As the only escape sequence; but use it only when one nee'^led to ex- 
ceed the ALA character set. Unless the diacritics conversion described 
above were done this would result in records with some diacritics af- 
ter the letter they modified (ALA) and others before the letter they 
modify (Unicode). This is undesirable even if the Unicode data were 
always in the 880 field where all nonroman data (Cyrillic, Hebrew, 
etc.) now resides. 

c. Use the Unicode escape as just one more escape when one needed to 
invoke a character set other than those now in use (i.e., Cyrillic, Ara- 
bic, xlebrew and CJK). Then Unicode would be used just for Greek, 
Indie and other writing systems that MARC does not now allow. This 
would minimize both the economic and networking advantages of using 
Unicode. 

4. Define fields that would use Unicode exclusi^el' In these helds each 
character would be two bytes long including xie indicators, delimiters, 
subfield codes and end of field character. Rather than define new fields, 
one could declare that for Unicode data the first character of each tag 
was alphabetic so 0 = A, 1 = B, etc. Then C45 (or possibly c45) would be 
the tag of a title field containing Unicode. While this too would result in 
records with diacritics before and after the letter they modify in different 
fields, the tag would give an early warning. 

5. Dual mode distribution could also be considered. Records for which both 
the ALA and Unicode sets were adequate could be made available with 
either the ALA or Unicode character sets at the recipient's option. This 
assumes that Unicode would assign codes to every element of the ALA 
set. It could complicate networking since two versions of many records 
would exist. 

In deciding how MARC will respond to Unicode we must weigh im- 
proved service and reduced dependence on expensive customized devices 
against the cost of conversion. Other factors include the risks that inac- 
tion would further isolate libraries from readers and that a subscriber might 
convert MARC records to Unicode and market them. 
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FURTHER READING 

This paper covers a very broad topic. The following items may help 
those wanting to read more about the use of computers with other writing 
systems; they are in chronological order. The hst does not pretend to be a 
comprehensive bibliography of the topic. 

Languages of the World That Can Be Set on 'Monotype' Machines. Com- 
piled bv R.A. Downie. London: 1963. (The Monotype Recorder, v. 42. 
no. 4) Good on the variety of scripts, nothing on their automation. 

Om Vikas. Use of Non-English Languages in Computers: A Selected Bib- 
liography. New Delhi: Electronics Comi.iission Information, Plctnning & 
Analysis Group, 1978. Impressive with 369 entries though some pertain to 
other roman alphabet languages. 

Akira Nakanishi. Writing Systems of the World. Rutland, \'t.: Tuttle, 1980. 
Similar to the first item. 

CALTIS. Pune, India: 1983-85. Papers from three meetings on CciUigraphy, 
lettering and typography of Indian scripts. 

Computer Processing of Chinese & Oriental Languages: An International 
Journal of the Chinese Language Computer Society. Montreal: 1983- 

Joseph D. Becker. "Multihngual Word Processing." Scientific American 215, 
no.l (July 1984): 96-109. An excellent introduction. 

SESAME Bulletin: Language Automation Worldwide. Harrogate, Eng.: 
1986- A quarterly journal: SESAME stands for Southeast, South Asia, Mid- 
dle East. 

Automated Systems for Access to Multilingual and Multiscript Library Ma- 
terials: Problems and Solutions. Edited by Christine Bofimeyer and Stephen 
VV. Massil. Miinchen, New York: K.G. Saur, 1987. (IFLA pubhcations, 38) 
Papers from an IFLA pre-conference. Tokyo, August 21-22, 1986. 

John Clews. Language Automation Worldwide: The Development of Char- 
acter Set Standards. Harrogate: SESAME Computer Projects, 1988. Good 
on library and other character sets. 
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Jack K.T. Huaiig and Timothy D. Huang. Introduction to Chinese, Japanese 
and Korean Computing. Singapore: Teaneck, N.J.: World Scientific, 1989. 

Computers and the Arabic Language. Edited bv Pierre Mackay, New York: 
Hemisphere, 1990. 

Randall K. Barry. "The Standards Dilemma of Character Sets.'' Informa- 
tion Standards Quarterly 3, no.2 (April 1991): 8-15. On library and other 
character set standards. 

Kenneth M. Sheldon. ^ACSIl Goes Global.'' Byte 17, no.7 (Julv 1991): 
108-15. On the two attempts to standardize computer codes for all wnting 
systems— Unicode and ISO 10646. 

Unicode Consortium, T/ie Unicode Standard: A Worldwide Character En- 
coding. Version 1.0. Reading. Mass.: Addison- Wesley. cl991- Volume one 
covers all modern scripts except those for China. Japan and Korea which 
wiU appear in volume two. 

Indian Script Code for Information Interchange. New Dellii: Bureau of 
Indian Standards, 1991. (IS 13194) 

Information Technology. Universal Multiple- Octet Cod Character Set, UCS, 
Fart 1: Architecture and Basic Multilingual Plane, (26 Dec. 1991). "'Work- 
ing document for ISO/IEC draft inteniationd standard 10646-1.2'*' 

Joan M. Aliprand. "Nonroman Scripts in the BibUographic Environment." 
Information Technology and Libraries 11. no.2 (June 1992): 105-19. Ablv 
covers much the same ground but aimed more toward svstems people. Dis- 
cusses ways MARC inight incorporate a global character set such as Unicode. 
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